BNSI Slovenian broadcast news database - speech and text corpus
نویسندگان
چکیده
This paper presents the BNSI Slovenian Broadcast News database project. The result of the project is a database with speech and text corpus oriented toward large vocabulary continuous speech recognition in general domain. The speech corpus consists of 36 hours of transcribed evening and late night news. The raw database material was captured in the archive of national broadcaster RTV Slovenia that was a partner in the project. General Broadcast News transcription conventions were supplemented with language specific rules. The Transcriber tool was used to produce the transcriptions. All additional tools needed during the annotation process were also installed on a computer. Statistics of speech corpus is presented in the paper. The BNSI text corpus is generated from broadcasts’ scenarios for a period of 7 years. 600 monthly shows’ collections of text are included. They will be used to improve the language modeling in highly inflectional Slovenian language. The BNSI Slovenian Broadcast News database will be available through ELRA/ELDA.
منابع مشابه
SINOD - Slovenian non-native speech database
This paper presents the SINOD database, which is the first Slovenian non-native speech database. It will be used to improve the performance of large vocabulary continuous speech recogniser for non-native speakers. The main quality impact is expected for acoustic models and recogniser’s vocabulary. The SINOD database is designed as supplement to the Slovenian BNSI Broadcast News database. The sa...
متن کاملThe Slovene BNSI Broadcast News database and reference speech corpus GOS: Towards the uniform guidelines for future work
The aim of the paper is to search for common guidelines for the future development of speech databases for less resourced languages in order to make them the most useful for both main fields of their use, linguistic research and speech technologies. We compare two standards for creating speech databases, one followed when developing the Slovene speech database for automatic speech recognition –...
متن کاملTwo step speaker segmentation method using Bayesian information criterion and adapted Gaussian mixtures models
This paper addresses the topic of online unsupervised speaker segmentation in a complex audio environment as it is present in the Broadcast News databases. A new two stage speaker change detection algorithm is proposed, which combines the Bayesian Information Criterion with an ABLS-SCD statistical framework where adapted Gaussian mixture models are used to achieve higher accuracy. To enhance th...
متن کاملDevelopment of Slovenian Broadcast News Speech Database
The paper reviews the development of a new Slovenian broadcast news speech database. The database consists of audio, video and annotation transcripts of about 34 hours of television daily news program captured from the public TV station RTVSLO. The paper addresses issues concerning transcription and annotation of the collected data, provides information on content analysis and basic statistics ...
متن کاملAcquisition and Annotation of Slovenian Broadcast News Database
This paper presents the Slovenian Broadcast News Database project that was started in year 2002 as cooperation between University of Maribor and Slovenian national broadcaster RTV Slovenia. The resulting database will be used for large vocabulary continuous speech recognition and multimedia database retrieval or archive indexation. First some organizational aspects that were needed in initial p...
متن کامل